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Abstract 

Image variability due to changes in pose and illumina- 
tion can seriously impair object recognition. This paper 
presents appearance-based methods which, unlike previ- 
ous appearance-based approaches, require only a small 
set of training images to generate a rich representation 
that models this variability. Specifically, from as few 
as three images of an object in fixed pose seen under 
slightly varying but unknown lighting, a surface and an 
albedo map are reconstructed. These are then used to 
generate synthetic images with large variations in pose 
and illumination and thus build a representation useful 
for object recognition. Our methods have been tested 
within the domain of face recognition on a subset of 
the Yale Face Database B containing 4050 images of 10 
faces seen under variable pose and illumination. This 
database was specifically gathered for testing these gen- 
erative methods. Their performance is shown to exceed 
that of popular existing methods. 

1 Introduction 

An object can appear strikingly different due to 
changes in pose and illumination (see Figure 1). To 
handle this image variability, object recognition sys- 
tems usually use one of the following approaches; (a) 
control viewing conditions, (b) employ a representation 
that is invariant to the viewing conditions, or (c) di- 
rectly model this variability. For example, there is a 
long tradition of performing edge detection at an early 
stage since the presence of an edge at an image location 
is thought to be largely independent of lighting. It has 
been observed, however, that methods for face recog- 
nition based on finding local image features and using 
their geometric relation are generally ineffective [4]. 

Here, we consider issues in modeling the effects of 
both pose and illumination variability rather than try- 
ing to achieve invariance to these viewing conditions. 
We show how these models can be exploited for re- 
constructing the 3-D geometry of objects and used to 
significantly increase the performance of appearance- 
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based recognition systems. We demonstrate the use 
of these models within the context of face recognition, 
but believe that they have much broader applicability. 

Methods have recently been introduced which use 
low- dimensional representations of images of objects 
to perform recognition, see for example [8, 13. 19]. 
These methods, often termed appearance-based meth- 
ods, differ from feature- based methods in that their 
low-dimensional representation is, in a least-squares 
sense, faithful to the original image. Systems such as 
SLAM [13] and Eigenfaces [19] have demonstrated the 
power of appearance-based methods both in ease of im- 
plementation and in accuracy. 

Yet. these methods suffer from an important draw- 
back: recognition of an object under a particular pose 
and lighting can be performed reliably provided the 
object has been previously seen under similar circum- 
stances. In other words, these methods in their original 
form have no way of extrapolating to novel viewing con- 
ditions. Here, we consider the construction of a gener- 
ative appearance model and demonstrate its usefulness 
for image-based rendering and recognition. 

The presented approach is 5 in spirit, an appearance- 
based method for recognizing objects under large varia- 
tions in pose and illumination. However, it differs sub- 
stantially from previous methods in that it uses as few 
as three images of each object seen in fixed pose and 
under small but unknown changes in lighting. From 
these images, it generates a rich representation that 
models the object's image variability due to pose and 
illumination. One might think that pose variation is 
harder to handle because of occlusion or appearance of 
surface points and the non-linear warping of the image 
coordinates.. Yet, as demonstrated by favorable recog- 
nition results, our approach can successfully generalize 
the concept of the illumination cone which models all 
the images of a Lambertian object in fixed pose under 
all variation in illumination [1]. 

New recognition algorithms based on these genera- 
tive models have been tested on a subset of the Yale 
Face Database B (see Figure 1) which was specifically 
gathered for this purpose. This subset contained 4050 
images of 10 faces each seen under 45 illumination con- 
ditions over nine poses. As we will see, these new al- 
gorithms outperform popular existing techniques. 
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Figure 1: Example images from the Yale Face Database 
B, showing the variability due to pose and illumination 
in the images of a single individual, a. An image from 
each of the nine different poses; b. A representative 
image from each illumination subset — Subset 1 (12°), 
Subset 2 (25°), Subset 3 (50°), Subset 4 (77°). 

2 Modeling Illumination and Pose 

2.1 The Illumination Cone 

In earlier work, it was shown that for a convex ob- 
ject with Lambertian reflectance, the set of all n-pixel 
images under an arbitrary combination of point light 
sources forms a convex polyhedral cone in the image 
space IR n . This cone can be built from as few as three 
images [1]. Here, we outline the relevant results. 

Let x € IR n denote an image with n pixels of a 
convex object with a Lambertian reflectance function 
illuminated by a single point source at infinity. Let 
B € IR nx3 be a matrix where each row in D is the 
product of the albedo with the inward pointing unit 



normal for a point on the surface projecting to a partic- 
ular pixel in the image. A point light source at infinity 
can be represented by s 6 IR 3 signifying the product 
of the light source intensity with a unit vector in the 
direction of the light source. A convex Lambertian sur- 
face with normals and albedo given by £?, illuminated 
by s, produces an image x given by 



x = max(Z?s, 0), 



(i) 



where max(5s, 0) sets to zero all negative components 
of the vector Bs. The pixels set to zero correspond to 
the surface points lying in an attached shadow. Con- 
vexity of the object's shape is assumed at this point 
to avoid cast shadows. Note that when no part of the 
surface is shadowed, x lies in the 3-D subspace £ given 
by the span of the columns of B [8, 14. 16]. 

If an object is illuminated by k light sources at in- 
finity, then the image is given by the superposition of 
the images which would have been produced by the 
individual light sources, i.e., 



x = max(J9si, 0) 



(2) 



t=l 



where Si is a single light source. Due to this super- 
position, it follows that the set of all possible images 
C of a convex Lambertian surface created by varying 
the direction and strength of an arbitrary number of 
point light sources at infinity is a convex cone. It is 
also evident from Equation 2 that this convex cone is 
completely described by matrix B, 

Furthermore, any image in the illumination cone C 
(including the boundary) can be determined as a con- 
vex combination of extreme rays (images) given by 

(3) 



where 



Xff = max(Bsjj.O). 
S{j = x hj. 



(4) 



The vectors and bj are the rows of B with i ^ j. It 
is clear that there are at most m(m — 1) extreme rays 
for m < n independent surface normals. 

2.2 Constructing the Illumination Cone 

Equations 3 and 4 suggest a way to construct the il- 
lumination cone for each object: gather three or more 
images in fixed pose under differing but unknown illu- 
mination without shadowing and use these images to 
estimate a basis for the 3-D illumination subspace C. 
One way of estimation is to normalize the images to be 
of unit length, and then use singular value decompo- 
sition (SVD) to calculate in a least-squares sense the 
best 3-D orthogonal basis in the form of matrix B*. 
Note that even if the columns of B* exactly span the 
subspace £, they differ from those of B by an unknown 
linear transformation, i.e.. B = B*A where A e CL(3); 
for any light source, x = Bs = {B*A)(A~ l s) [10]. 
Nonetheless, both B* and B define the same illumi- 
nation cone C and represent valid illumination models. 



From B*, the extreme rays defining the illumination 
cone C can he computed using Equations 3 and 4. 

Unfortunately, using SVD in the above procedure 
leads to an inaccurate estimate of D* . For even a con- 
vex object whose occluding contour is visible, there is 
only one light source direction (the viewing direction) 
for which no point on the surface is in shadow. For any 
other light source direction, shadows will be present. If 
the object is non-convex, such as a face, then shadow- 
ing in the modeling images is likely to be more pro- 
nounced. When SVD is used to find B* from images 
with shadows, these systematic errors bias its estimate 
significantly. Therefore, an alternative way is needed 
to find B* that takes into account the fact that some 
data values are invalid and should not be used in the 
estimation. For the purpose of this estimation, any 
invalid data can be treated as missing measurements. 

The technique we use here is a combination of two 
algorithms. A variation of [17] (see also [11, 18]) which 
finds a basis for the 3-D linear subspace £ from image 
data with missing elements is used together with the 
method in [6] which enforces intcgrability in shape from 
shading. We have modified the latter method to guar- 
antee integr ability in the estimates of the basis vectors 
of subspace C from multiple images. By enforcing intc- 
grability a surface context is introduced. Namely, the 
vector field induced by the basis vectors is guaranteed 
to be a gradient field that corresponds to a surface. 

Furthermore, enforcing integrability inherently leads 
to more accurate estimates because there are fewer pa- 
rameters (or degrees of freedom) to determine. It also 
resolves six out of the nine parameters of A e GL(3). 
The other three correspond to the generalized bas-relief 
(GBR) transformation parameters which cannot be re- 
solved with illumination information alone (i.e. shad- 
ing and shadows) [2], This means we cannot recover the 
true matrix B and its corresponding surface. z(x t y). 
We can only find their GBR. versions B and z(x,y). 

Our estimation algorithm is iterative and to enforce 
integrability, the possibly non-integrable vector field in- 
duced by the current estimate of B* is, in each itera- 
tion, projected down to the space of integrable vector 
fields, or gradient fields [6]. To begin, let us expand 
the surface z(x,y) using basis surfaces (functions): 

z(z, y; c(w)) = 53 S(w)^(x, y; w) (5) 

where w = (w Xi w y ) is a two dimensional index, and 
{<p(x,y; w)} is a finite set of basis functions which are 
not necessarily orthogonal. We chose the discrete co- 
sine basis so that {c(w)} is exactly the set of the 2-D 
discrete cosine transform (DCT) coefficients of z(x y y). 

Note that the partial derivatives of z(x. y) can also 
be expressed in terms of this expansion, giving 

«x(a,2/;c(w)) = J^w^-foyjw) (6) 

and 

^,(x,y;c(w)) = £c(w)0 v (z,y;w). (7) 



Since the partial derivatives of the basis functions, 
<j> x {x,y\ w) and </> y {x, y; w), are integrable and the ex- 
pansions of z x (x,y) and z y (x.y) share the same coeffi- 
cients c(w), it is easy to see that z xy {x,y) — z yx (x y y). 

Suppose, now. we have the possibly non-integrable 
estimate B* from which we can easily deduce the 
possibly non-integrable partial derivatives z* (x, y) and 
z*(x.y). These can also be expressed as a series, giving 

*;(z,i,;cf(w)) = ^(w^Or.j/jw) (8) 

and 

z;{x,y^ 2 (w)) = ^^(w)<^(x ;2 /;w). (9) 

Note that in general d\ (w) ^ r4( w ) which implies that 

Let us assume that z*(x,y) and z*(x,y) are known 
from an estimate of B* and we would like to find 
z x (x % y) and z y {x^y) (a set of integrable partial deriva- 
tives) which are as close as possible to z*(x } y) and 
z y( x - f y)^ respectively, in a least-squares sense. The goal 
is to minimize the following, 

m l n Y, (^x(x,y;c) -z*(z,2/;c,)) 2 + 

{z v (x,y\c)-z;{x,y;c? 2 )y 2 . (10) 

In other words, take a set of possibly non-integrable 
partial derivatives, z*(x.y) and z*(.r,y), and "enforce" 
integrability by finding the least-squares fit of inte- 
grable partial derivatives z x (x : y) and z v {x^y). Notice 
that to get the GBR transformed surface z{x. t y) we 
need only perform the inverse 2-D DCT on the coeffi- 
cients c(w). 

The above procedure is incorporated into the follow- 
ing algorithm. To begin, define the data matrix for k 
images of an individual to be X = [xi , . . . , x k ]. If there 
were no shadowing, X would be rank 3 [15] (assuming 
no image noise), and we could use SVD to factorize X 
into X = B*S where S is a 3 x k matrix whose columns 
s, : are the light source directions scaled by their corre- 
sponding source intensities for all k images. 

Since the images have shadows (both cast and at- 
tached), and possibly saturations, we first have to de- 
termine which data values do not satisfy the Lamber- 
tian assumption. Unlike saturations, which can be sim- 
ply determined, finding shadows is more involved. In 
our implementation, a pixel is assigned to be in shadow 
if its value divided by its corresponding albedo is be- 
low a threshold. As an initial estimate of the albedo we 
use the average of the modeling (or training) images. 
A conservative threshold is then chosen to determine 
shadows making it almost certain no invalid data is in- 
cluded in the estimation process, at the small expense 
of throwing away a few valid measurements. After find- 
ing the invalid data, the following estimation method 
is used: 

1. Use the average of the modeling (or training) im- 
ages as an initial estimate of the albedo. 




(i. 



Figure 2: The process of constructing the cone 6\ 
a. The training images; b. Images corresponding to 
columns of B\ c. Reconstruction up to a GBR. transfor- 
mation; d. Sample images from the illumination cone 
under novel lighting conditions in fixed pose. 

2. Without doing any row or column permutations 
sift out all the full rows (with no invalid data) of 
matrix X to form a full sub-matrix X. 

3. Perform SVD on X to get an initial estimate of S. 

4. Fix S and the albedo, and estimate a possibly 
non-integrable set z*(x, y) and z*(x, y) using least- 
squares. 

5. By minimizing the cost functional in Equation 10, 
estimate (as functions of c(w)) a set of integrable 
partial derivatives z x (x.y) and z y (x,y). 



6. Fix 5 and use z x (x,y) and z y (x. y) to update the 
albedo using least-squares. 

7. Use the newly calculated albedo and the partial 
derivatives z x (x,y) and z y {x^y) to construct B. 

8. Then, fix B and update each of the light source 
directions Sj independently using least-squares. 

9. Repeat steps 4-8 until the estimates converge. 
10. Perform inverse DCT on the coefficients c(w) to 

get the GBR surface z(x.y). 
In our experiments, the algorithm is well behaved, pro- 
vided the input data is well conditioned, and converges 
within 10-15 iterations. 

Figure 2 demonstrates the process for constructing 
the illumination cone: Figure 2. a shows six of the 19 
single light source images of a face used in the estima- 
tion of matrix B. Note that the light source in each 
image moves only by a small amount (±15° in either 
direction) about the viewing axis. Despite this, the im- 
ages do exhibit some shadowing, e.g. left and right of 
the nose. Figure_2.b shows the basis images of the es- 
timated matrix B. These basis images encode not only 
the albedo (reflectance) of the face but also its surface 
normal field. They can be used to construct images 
of the face under arbitrary and quite extreme illumi- 
nation conditions. Figure 2.c shows the reconstructed 
surface of the face z(x, y) up to a GBR. transformation. 
The first basis image of matrix B shown in Figure 2,b 
has been texture-mapped on the surface. 

Figure 2.d shows images of the face generated using 
the image formation model in Equation 1 which has 
been extended to account for cast shadows. To deter- 
mine cast shadows, we employ ray-tracing that uses the 
reconstructed GBR surface of the face z{x,y). With 
this extended image formation model, the generated 
images exhibit realistic shading and, unlike the images 
in Figure 2. a. have strong attached and cast shadows. 

2.3 Image Synthesis Under Differing Pose 
and Lighting 

The reconstructed surface and the illumination cones 
nan be combined to synthesize novel images of an ob- 
ject under differing pose and lighting. However, one 
complication arises because of the generalized bas-relief 
(GBR,) ambiguity. Even though shadows are preserved 
under GBR transformations [2], without resolution of 
this ambiguity, images with non-frontal view-point syn- 
thesized from a GBR reconstruction will differ from a 
valid image by an affine warp of image coordinates. (It 
is affine because GBR is a 3-D affine transformation 
and the weak perspective imaging model assumed here 
is linear.) Since the affine warp is an image transfor- 
mation, one could perform recognition over variation in 
viewing direction and affine image transformations. Al- 
ternatively, one can attempt to resolve the GBR. ambi- 
guity to obtain a Euclidean reconstruction using class- 
specific information. In our experiments with faces, 
we essentially try to fit the GBR reconstructions to 
a canonical face. We take advantage of the left-to- 
right symmetry of faces and the fairly constant ratios 




Figure 3: Synthesized images under variable pose and 
lighting. The representation was constructed from the 
images in Figure 2. a. 

of distances between facial features such as the eyes, 
the nose, and the forehead to resolve the three param- 
eters of the GBR ambiguity. Once resolved, it is a 
simple matter to use ray-tracing techniques to render 
synthetic images under variable pose and lighting. 

Figure 3 shows synthetic images of the face under 
novel pose and lighting. These images were generated 
from the images in Fig. 2. a where the pose is fixed and 
there are only small, unknown variations in illumina- 
tion. In contrast, the synthetic images exhibit not only 
large variations in pose but also a wide range in shading 
and shadowing. 

3 Representations for Recognition 

It is clear that for every pose of the object, the set of 
images under all lighting conditions is a convex cone. 
Therefore, the previous section provides a natural way 
for generating synthetic representations of objects suit- 
able for recognition under variable pose and illumina- 
tion. For every sample pose of the object, generate its 
illumination cone and with the union of all the cones 
form its representation. 

However, the number of independent normals in B 
can be large (more than a thousand) hence the number 
of extreme rays needed to completely define the illu- 
mination cone can run in the millions (see Section 2). 
Therefore, we must approximate the cone in some fash- 
ion; in this work, we choose to use a small number 
of extreme rays (images). The hope is that a sub- 
sampled cone will provide an approximation that neg- 
ligibly decreases recognition performance; in our expe- 
rience, around 80 images are sufficient, provided that 
the corresponding light source directions s<j are more 
or less uniform on the illumination sphere. The result- 
ing cone C* is a subset of the object's true cone C for 
a particular pose. 
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Figure 4: TOP ROW: Three images from the test set. 
BOTTOM ROW; The closest reconstructed image from 
the representation. Note that these images are not ex- 
plicitly stored, but lie within the closest matching lin- 
ear subspace. 

Another simplifying factor that can reduce the size 
of the representation is the assumption of a weak per- 
spective imaging model. Under this model, the effect 
of pose variation can be decoupled into that due to 
image plane translation, rotation, and scaling (a simi- 
larity transformation), and that due to the viewpoint 
direction. Within a face recognition system, the face 
detection process generally provides estimates for the 
image plane transformations. Neglecting the effects of 
occlusion or appearance of surface points, the variation 
due to viewpoint can be seen as a non-linear warp of the 
image coordinates with only two degrees of freedom. 

Yet, recognition using this representation consisting 
of sub-sampled illumination cones will still be costly 
since computing distance to a cone is 0(ne 2 ), where n 
is the number of pixels and e is the number of extreme 
rays (images). From an empirical study, it was conjec- 
tured in [1] that the cone for typical objects is flat (i.e., 
all points lie near a low-dimensional linear subspace), 
and this was confirmed for faces in [5]. Hence, an alter- 
native is to model a face in fixed pose but over all light- 
ing conditions by a low- dimensional linear subspace. 
Finally, for a set of sample viewing directions, we con- 
struct subspaces which approximate the corresponding 
cones. We chose to use an 11-D linear subspace for 
each pose since 11 dimensions capture over 99% of the 
valuation in the sample extreme rays. Recognition of a 
test image x is then performed by finding the closest 
linear subspace to x. Figure 4 shows the closest match 
for images of an individual in three poses. This figure 
qualitatively demonstrates how well the union of 11-D 
subspaces approximates the true cones. 

For the experimental results reported below, sub- 
spaces were constructed by sampling the viewing 
sphere at 4° intervals over the elevation from -24° to 
+24° and the azimuth from -4° to +28° about frontal. 
As a final speed-up, the 117 11-D linear subspaces were 
projected down to a 100-dimensional subspace of the 
image space whose basis vectors were computed using 




Figure 5: A geodesic dome with 64 strobes used to 
gather images under variable illumination and pose. 



SVD. In summary, each person's face was represented 
by the union of 117 11- D linear subspaces within a 100- 
dimensional subspace of the image space. Recognition 
was then performed by computing the distance of a test 
image to each 100-D subspace plus the distance to the 
11-D subspaces within the 100-D space. 

4 Recognition Results 

The experimentation reported here was performed on 
the Yale Face Database B. For capturing this database, 
we have constructed a geodesic lighting rig with 64 
computer controlled xenon strobes shown in Figure 5. 
With this rig. we can modify the illumination at frame 
rates and capture images under variable pose and il- 
lumination. Images of ten individuals were acquired 
under 64 different lighting conditions in nine poses 
(frontal pose, five poses at 12° and three poses at 24° 
from the camera's axis). Of the 64 images per person 
in each pose, 45 were used in our experiments, a to- 
tal of 4050 images. The images from each pose were 
divided into 4 subsets (12°, 25°, 50° and 77°) accord- 
ing to the angle of the light source with the camera's 
axis (see Figure 1). Subset 1 (respectively 2, 3, 4) con- 
tains 70 (respectively 120, 120, 140) images per pose. 
Throughout, the 19 images of Subsets 1 and 2 from the 
frontal pose of each face were used as training images 
for generating its representation. 

4.1 Extrapolation in Illumination 

The first set of experiments was performed under fixed 
pose on the 450 images from the frontal pose (45 per 
person). This was to compare three other recogni- 
tion methods to the illumination cones representation. 
From a set of face images labeled with the person's 
identity (the learning set) and an unlabeled set of face 
images from the same group of people (the test $et)> 
each algorithm is used to identify the person in the 
test images. For more details about the comparison 
algorithms, see [3] and [7]. We assume that each face 
has been located and aligned within the imago. 
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Extrapolation in Illumination 
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0.0 


0.0 


0.0 
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Figure 6: Extrapolation in Illumination: Each of 
the methods is trained on images with near frontal illu- 
mination (Subsets 1 and 2) from Pose 1 (frontal pose). 
This graph shows the error rates under more extreme 
light source conditions in fixed pose. 



The simplest recognition scheme is a nearest neigh- 
bor classifier in the image space [4]. An image in the 
test set is recognized (classified) by assigning to it the 
label of the closest point in the learning set. where dis- 
tances are measured in the image space. When all of 
the images are normalized to have zero mean and unit 
variance, this procedure is also known as Correlation. 

A technique now commonly used in computer 
vision — particularly in face recognition — is principal 
components analysis (PC A) which is popularly known 
as Eigenfaces [8, 12, 13, 19]. One proposed method for 
handling illumination variation in PC A is to discard 
the three most significant principal components; in 
practice, this yields better recognition performance [3]. 
For both the Eigenfaces and Correlation tests, the im- 
ages were normalized to have zero mean and unit vari- 
ance, as this improved the performance of these meth- 
ods. This also made their results independent of light 
source intensity. For the Eigenfaces method, we used 
20 principal components; recall that performance ap- 
proaches correlation as the dimension of the feature 
space is increased [3, 13]. Error rates are also presented 
when the principal components four through twenty- 
three were used. 

A third approach is to model the illumination varia- 
tion of each face with the three-dimensional linear sub- 
space £ described in Section 2.1. To perform recogni- 



tion, we simply compute the distance of the test image 
to each linear subspace and choose the face correspond- 
ing to the shortest distance. We call this recognition 
scheme the Linear Snbspace method [2]; it is a variant 
of the photometric alignment method proposed in [16] 
and is related to [9, 14]. While this models the varia- 
tion in image intensities when the surface is completely 
illuminated, it does not model shadowing. 

Finally, recognition is performed using the illumina- 
tion cone representation. In fact, we tested on three 
variations. In the first (Cones-attached), the represen- 
tation was constructed without cast shadows, so the ex- 
treme rays are generated directly from Equation 3. In 
the second variation (Cones-cast), the representation 
was constructed as described in Section 2.2 where we 
employed ray- tracing that uses the reconstructed sur- 
face of a face z(x, y) to determine cast shadows. In both 
variations, recognition was performed by computing 
the distance of the test image to each cone and choosing 
the face corresponding to the shortest distance. Since 
cones are convex, the distance can be found by solving 
a convex optimization problem (see [7]). 

In the last variation, the illumination cone of each 
face with cast shadows C* is approximated by an 11-D 
dimensional linear subspace (Cones-cast subspace ap- 
proximation). As mentioned before, it was empirically 
determined that 11 dimensions capture over 99% of 
the variance in the sample extreme rays. The basis 
vectors for this space are determined by performing 
SVD on the extreme rays in C* and then picking the 11 
eigenvectors associated with the largest singular values. 
Recognition was performed by computing the distance 
of the test image to each linear subspace and choosing 
the face corresponding to the shortest distance. Us- 
ing the cone subspace approximation reduces both the 
storage and the computational time. Since the basis 
vectors of each subspace are orthogonal the computa- 
tional complexity is only 0(nm) where n is the number 
of pixels and m is the number of the basis vectors. 

Similar to the extrapolation experiment described 
in [3], each method was trained on samples from Sub- 
sets 1 and 2 (19 samples per person) and then tested 
on samples from Subsets 3 and 4. Figure 6 shows the 
results from this experiment. (This test was also per- 
formed on the Harvard Robotics Lab face database and 
was reported in [7].) Note that the cone subspace ap- 
proximation performed as well as the raw illumination 
cones without any mistakes on 450 images. This sup- 
ports the use of low dimensional subspaces in the full 
representation of Section 3 that models image varia- 
tions due to viewing direction and lighting. 

4.2 Recognition Under Variable Pose and 
Illumination 

Next, we performed recognition experiments on images 
in which the pose varies as well as illumination. Im- 
ages from all nine poses in the database were used in 
these tests. Four recognition methods were compared 
on 4050 images. Each method was trained on images 
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Figure 7: Extrapolation in Pose: Error rates as 
the viewing direction becomes more extreme. Again, 
the methods were trained on images with near frontal 
illumination (Subsets 1 and 2) from Pose 1 (frontal 
pose). Note that each reported error rate is for all 
illumination subsets (1 through 4). 

with near frontal illumination (Subsets 1 and 2) from 
the frontal pose, and tested on all images from all nine 
poses — an extrapolation in both pose and illumination. 

The first method was Correlation as described in the 
previous section. The next one (Cones approximation) 
modeled a face with an 11-D subspace approximation of 
the cone (with cast shadows) in the frontal pose. No ef- 
fort was done to accommodate pose during recognition, 
not even a search in image plane transformations. The 
next method (Cones approximation with planar trans- 
formations) also modeled a face with an 11-D subspace 
approximation of the cone in the frontal pose, but un- 
like the previous method, recognition was performed 
over variations of planar transformations. Finally, a 
face was modeled with the representation described in 
Section 3. Each of the 10 individuals was represented 
by a 100-D subspace which contained 117 11-D lin- 
ear subspaces each modeling the variation in illumina- 
tion for each sampled view-point. As with the previous 
method, recognition was performed over a variation of 
planar transformations. The results of these experi- 
ments are shown in Figure 7. Note that each reported 
error rate is for a/Hllumination subsets (1 through 4). 
Figure 8, on the other hand, shows the break-down of 
the results of the last method for different poses against 
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r igure o: n«rror rates ior cimerent poses against varianie 
lighting using the representation of Section 3. 

variable illumination. As demonstrated in Figure 7, 
the method of cone subspace approximation with pla- 
nar transformations performs reasonably well for poses 
up to 12° from the viewing axis but fails when the 
viewpoint becomes more extreme. 

We note that in the last two methods the search in 
planar transformations did not include image rotations 
(only translations and scale) to reduce computational 
time. Wc believe that the results would improve if im- 
age rotations were included or even if the view-point 
space and illumination cones were more densely sam- 
pled and the 11-D subspaces were not projected down 
to a 100-D subspace. 

5 Discussion 

In constructing the representation of an object from 
a small set of training images, we have assumed that 
the object's surface exhibited a Lambert ian reflectance 
function. Although our results support this assump- 
tion, more complex reflectance functions may yield bet- 
ter recognition results. Other exciting domains for 
these representations include facial expression recog- 
nition and object recognition with occlusions. 
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FACE RECOGNITION USING STATISTICAL MODELS 

G J Edwards, A Lanitis, C J Taylor and T F Cootes* 

We describe the use of flexible models for the representation of shape and grey-level appearance of human 
faces. The models are controlled by a small number of parameters, which can be used to code the overall 
appearance of a face for image compression and classification. Shape and grey-level appearance are included 
in a single model. Discriminant analysis allows the isolation of variation important for classification of 
identity. We have performed both face recognition and face synthesis experiments and present the results in 
this paper. 

Introduction 

A successful face recognition system should be able to locate a face, and classify its identity, regardless of 
factors such as pose, lighting and expression variation. Human faces are highly variable objects, both in terms 
of the different appearance of individuals, and the variation present in any individual face. In this context, the 
analysis of human faces presents a difficult machine vision task. As a result of this difficulty, some 
researchers have concentrated on particularly constrained applications; contributing little to overall progress. 
Others have attempted to tackle the various generic problems independently; the drawback of this approach is 
that the effects of all the sources of variability are compounded, so it is extremely difficult to extract a 
description for one characteristic of interest (e.g. individual appearance) which is not sensitive to others, (e.g. 
facial expression, lighting and pose). 1 Many current techniques can be found in the review by Chellapa et al. 2 

Rather than trying to separate face analysis into various goals, such as feature location, person identification, 
expression recognition, lighting correction, etc., we have developed a unified approach. The basis for this is a 
compact, parameterized model of facial appearance, which accounts for all the important, systematic sources 
of variability. Our approach consists of both modeling, in which flexible appearance models of facial 
appearance are generated, and interpretation, in which the models are used to analyze information content of 
the face image, such as the identity of the individual. 

Modeling Shape Appearance 

In order to understand the appearance of faces, we model both the shape and grey-level appearance of a 
training set of face images. All the models used in our system are of the same mathematical form. Each of the 
training examples is represented by N variables: 



where x ki is the kth variable in the rth example 

When modeling the shape of faces, these variables represent the positions of key landmark points on the 
images in the training set. From the training examples we build a Point Distribution Model (PDM) 3 . Each 
training image is marked with a set of 144 labeled points, corresponding to specific facial features. An 
example of a face overlaid with landmark points is shown in Figure 1. Given a set of these training vectors, 
the average example, is calculated and the deviation of each example from the mean established. A 
principal component analysis of the covariance matrix of the deviations reveals the main linearly independent 
modes of variation of face shape, which together represent nearly all the variation in the training set ( typically 
30 modes for 99.5% of a training set containing 400 examples ) . Any training example, X; can be 
approximated by using: 



* The authors are in The Department of Medical Biophysics, University of Manchester, Oxford Rd., Manchester. Tel No. 
0161 275 5130. Email {gje,lan,bim,cjt}@svl. smb.man.ac.uk 
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where P is a matrix of unit eigenvectors of the covariance of deviations and b is a vector of eigenvector 
weights ( these are referred to as model parameters ). By modifying b, new instances of the model can be 
generated; if the elements of b, are kept within a few standard deviations of the mean over the training set, 
then the corresponding model instances are plausible examples of the modeled objects. By varying each model 
parameter over this limited range, we can illustrate the modes of variation of the model, as shown in Figure 2. 

Since the columns of P are orthogonal, P T P - 1, and equation 2 can be solved with respect to b; 

b = P T (X-X mean ) (3) 

Equation 3 can be used to transform an example X into model parameters. 




Example Overlaid Figure 2. First 4 modes of shape variation shown varying 

with Landmark Points horizontally 



Modeling Full Appearance 

The same statistical method can be used to model the grey-level appearance of faces. We wish to model grey- 
level appearance independently of shape. To do this, we first apply a warping algorithm 4 developed by 
Bookstein based on thin-plate splines, which warps each example to the mean shape, in such a way that grey- 
level changes around each landmark are kept to a minimum. Each training example is then represented by the 
pixel intensity values in the mean shape patch. After applying Principal Component Analysis, as for the shape 
model, it is possible to represent 95% of the variation in the 400 example training set by 70 parameters. 

In order to complete the model of facial appearance we combine the shape and grey-level models to produce a 
Combined Appearance MqM, Given the shape and grey-level models we obtain the model parameters for 
each training example, using Equation 3, and concatenate the two parameter vectors. A principal component 
analysis of these concatenated vectors over all the training examples leads to a single model describing both 
shape and grey-level variation. This combined model captures 95% of the variance in the 400 example 
training set using 55 parameters. The model fully accounts for correlation between shape and grey-level 
appearance. 

Figure 3 shows the major modes of orthogonal variation of this combined appearance model. It can be seen 
that in each of the modes of variation, several sources of variation are compounded, showing variation due to 
pose, expression, lighting and ID. 
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Figure 3. First 2 Modes of Full 
Variation Shown Horizontally +/- 3 
SD's. 



Locating and Tracking Faces 

The flexible shape model can be used in an Active Shape Model(ASM) search 5 to locate and track faces in 
static images and image sequences. During the training phase a model is built of the expected grey-level 
variation around each landmark point. In order to locate and track a face, an instance of the face model is 
placed in the initial image and is allowed to interact until it fits the shape of the face. Each model point 
attempts to move towards the best local match, but the shape of the whole set of points can only be changed 
by varying the shape parameters, thus ensuring that resulting shapes are similar to those encountered in the 
training set. In order to allow a greater search range, the algorithm can be performed at lower resolution levels 
until a certain degree of fit is achieved, before switching to higher resolutions. Figure 4 shows some examples 
of face location. 




Figure 4. Examples of Successful Face Location 
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Identifying Faces 



When a test image is presented to our system, the flexible shape model is used to locate the face and features 
automatically, using a Multi-Resolution Active Shape Model search. In the results presented in this paper the 
user is asked to indicate the approximate position of the nose, and the shape model is overlaid. Euclidean 
transformations and deformations are applied until the model is fitted to the face presented. Lanitis et al. have 
shown that it is possible to locate the face without any user initialization using global optimization techniques. 
Once the model has located the face, we find the appearance model parameters and local grey-level model 
parameters using Equation 3; These are used for classification. 

Since each of the variables shows variation due to different sources of variation, it is important to emphasize 
variation which is important for recognition, by using discriminant analysis 7 . This can be done by calculating 
the Canonical Discriminant Functions 8 , or in it's simplest form, by assigning a class label based on the 
. Mahalanobis distance. The Mahalanobis distance measure automatically assigns higher weights to those 
variables which showed a greater deal of inter-class ( inter-person ) variation during the training phase. 

Lanitis et al. 9 have performed person recognition trials, the main results of which are outlined below: 

The training set consisted of 30 individuals, 10 images of each. 

The main test set consisted of 10 unseen images of each of the 30 individuals. 

A secondary test set consisted of 3 images each of 'difficult 5 images of the 30 individuals. 

For all individuals, the test and training images showed varying pose, lighting and expression conditions. The 
'difficult 7 test images showed partial occlusion of the face. Some examples of the images used are shown in 
Figure 5. 

Normal Test Set ( 200 Images ) - Correct 95.5% 

- Correct within best 3 99.0% 

Difficult Test Set ( 60 Images ) - Correct 43.3% 

- Correct within best 3 71.6% 

These results include enrors where either the face was not correctly located, or where it was correctly located 
but incorrectly identified. 



Training 
Images 



Test 
Images 



Difficult 
Images 




Figure 5. Training and Test Examples 
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Reconstruction and Coding 



Given a set of combined appearance model parameters it is possible to reconstruct a face image. Figure 6 
shows a face image, together with its parameterized reconstruction. The reconstruction is made from 55 
parameters, the original face image is at 320x256 resolution. This represents a very high degree of 
compression. 

We have (Edwards et al. 8 ) addressed the problem of extracting specific types of variation from the combined 
models, in order to make the models more specific for particular applications. Using Canonical Discriminant 
Analysis it is possible to define a set of orthogonal modes of variation which correspond only to one type of 
variation, for example, change of identity. Also, having found these modes, it is possible to remove them from 
the model. The resulting model allows us to manipulate face images without changing their identity. In Figure 
7, we show an example of a face, and some manipulations of that face, by varying parameters which have 
been selected so as not to change the identity of the face. Thus, given a single example of a face, we can 
synthesize its appearance under different conditions of pose, lighting, and expression. When only those modes 
of variation which do not change identity are considered, the model achieves further compression, reducing to 
just 21 parameters. 






Figure 7. Face Image Manipulated by Varying 
Parameters which don't Affect Identity 

Figure 6. Face with its 
Parameterized Reconstruction 



Conclusions 



We have presented a system, which can be used for locating and tracking faces, coding, reconstruction, and 
identification. Our recognition results are very encouraging, especially considering the allowed variation in 
pose, lighting and expression. The statistical approach allows unimportant variation to be dealt with 
automatically. Our system copes with all aspects of face image processing within a single framework. The 
ability to locate and identify faces has potential for powerful security applications, particularly in access 
control and person monitoring. A major benefit of the system is that it is likely to be entirely passive, 
requiring no interaction (or even knowledge of presence) unlike, say, a keycard, or fingerprint reader. 

The small number of parameters required to code a face image allows very compact storage. Access cards, 
Bank cards, and such like, could store encrypted appearance parameters of their owner in the magnetic strip, 
the reader of which would display an image of the true owner to the Bank teller or Shop assistant, virtually 
eliminating the use of fake or stolen credit cards. The high compression would also allow very fast 
comparison of faces with image databases. It is possible to envisage police and security services equipped 
with small CCD cameras able to instantly compare a live person with a database of faces. Statistical models 
are equally useful for reconstruction and manipulation as shown in Figures 6 and 7. The ability to manipulate 
images in a photo-realistic way has potential usefulness in forensic techniques such as photo-fit. 
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Abstract 

We present an illumination-based method for synthe- 
sizing images of an object under novel viewing condi- 
tions. Our method requires as few as three images of 
the object taken under variable illumination, but from 
a fixed viewpoint. Unlike multi-view based image syn- 
thesis, our method does not require the determination 
of point or line coirespondences. Furthemnore, our 
method is able to synthesize not simply novel view- 
points, but novel illumination conditions as well. We 
demonstrate the effectiveness of our approach by gen- 
erating synthetic images of human faces. 

1 Introduction 

We present an illumination- based method for creat- 
ing novel images of an object under differing pose and 
lighting. This method uses as few as three images 
of the object taken under variable lighting but fixed 
pose to estimate the object's albedo and generate its 
geometric structure. Our approach does not require 
any knowledge about the light source directions in the 
modeling images, or the establishment of point or line 
correspondences. 

In contrast, nearly all approaches to view synthesis 
or image-based rendering take a set of images gath- 
ered from multiple viewpoints and apply techniques 
akin to structure from motion [17, 28, 6], stereopsis 
[21, 9], image transfer [3], image warping [18,20,24], 
or image morphing [7. 23]. Each of these methods 
requires the establishment of correspondence between 
image data (e.g. pixels) across the set. (Unlike other 
methods, the Lumigraph [12, 19] exhaustively sam- 
ples the ray space and renders images of an object from 
novel viewpoints by taking 2— D slices of the A—D light 
field at the appropriate directions.) Since dense corre- 
spondence is difficult to obtain, most methods extract 
sparse image features (e.g. corners, lines), and may 
use multi-view geometric constraints (e.g. the trifocal 
tensor [2, 1]) or scene-dependent geometric constraints 
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[9, 8] to reduce the search process and constrain the es- 
timates. By using a sequence of images taken at nearby 
viewpoints, incremental tracking can further simplify 
the process, particularly when features are sparse. 

For these approaches to be effective, there must be 
sufficient texture or viewpoint-independent scene fea- 
tures, such as albedo discontinuities or surface nor- 
mal discontinuities. From sparse correspondence, the 
epipolar geometry can be established and stereo tech- 
niques can be used to provide dense reconstruction. 
Underlying nearly all such stereo algorithms is a con- 
stant brightness assumption that is, the intensity (ir- 
radiance) of corresponding pixels should be the same. 
In turn, constant brightness implies two seldom stated 
assumptions: (1) The scene is Lambertian, and (2) the 
lighting is static with respect to the scene only the 
viewpoint is changing. 

In the presented illumination-based approach, we 
also assume that the surface is Lambert iaii, although 
this assumption is very explicit. As a dual to the sec- 
ond point listed above, ovir method requires that the 
camera remains static with respect to the scene - only 
the lighting is changing. As a consequence, geomet- 
ric correspondence is trivially established, and so the 
method can be applied to scenes where it is difficult 
to establish multi-viewpoint correspondence, namely 
scenes that are highly textured (i.e. where image fea- 
tures are not sparse) or scenes that completely lack 
texture (i.e. where there are insufficient image fea- 
tures) . 

At the core of our approach for generating novel 
viewpoints is a variant of photometric stereo [27, 29, 
14. 13. 30] which simultaneously estimates geometry 
and albedo across the scene. However, the main limi- 
tation of classical photometric stereo is that the light 
source positions must be accurately known, and this 
necessitates a fixed lighting rig as might be possible in 
an industrial setting. Instead, the proposed method 
does not require knowledge of light source locations, 
and so illumination could be varied by simply waiving 
a light around the scene. 

In fact, our method derives from work by Belhumeur 
and Kriegrnan in [5] where they showed that a small 
set of images with unknown light source directions can 



be used to generate a representation the illumination 
cone - which models the complete set of images of an 
object (in fixed pose) under all possible illumination. 
This method had as its pre-cursor the work of Shashua 
[25] who showed that, in the absence of shadows, the 
set of images of an object lies in a 3 - D subspace in 
the image space. Generated images from the illumi- 
nation cone representation accurately depict shading 
and attached shadows under extreme lighting; in [11] 
the cone representation was extended to include cast 
shadows (shadows the object casts on itself) for ob- 
jects with non-convex shapes. Unlike attached shad- 
ows, cast, shadows are global effects, and their predic- 
tion requires the reconstruction of the object's surface. 

In generating the geometric structure, multi- 
viewpoint methods typically estimate depth directly 
from corresponding image points [21, 9]. It is well 
known that without sub-pixel correspondence, stereop- 
sis provides a modest number of disparities over the ef- 
fective operating range, and so smoothness or regular- 
ization constraints are used to interpolate and provide 
smooth surfaces. The presented illumination-based 
method estimates surface normals which are then in- 
tegrated to generate a surface. As a result, very subtle 
changes in depth are recovered as demonstrated in the 
synthetic images in Figures 4 and 5. Those images 
show also the effectiveness of our approach in gener- 
ating realistic images of faces under novel pose and 
illumination conditions. 

2 Illumination Modeling 

In [5], Belhumeur and Kriegman have shown that, for 
a convex object with a Lambertian reflectance func- 
tion, the set of all images under an arbitrary combina- 
tion of point light sources forms a convex polyhedral 
cone in the image space IR n which can be constructed 
with as few as three images. 

Let x G IR n denote an image with n pixels of a 
convex object with a Lambertian reflectance function 
illuminated by a single point source at infinity. Let 
B £ IR nx3 be a matrix where each row in B is the 
product of the albedo with the inward pointing unit 
normal for a point on the surface projecting to a partic- 
ular pixel in the image. A point light source at infinity 
can be represented by s G 1R 3 signifying the product 
of the light source intensity with a unit vector in the 
direction of the light source. A convex Lambertian sur- 
face with normals and albedo given by B. illuminated 
by s, produces an image x given by 

x = max(Z?s,0), (1) 

where max(Bs.O) sets to zero all negative components 
of the vector Bs. The pixels set to zero correspond to 
the surface points lying in an attached shadow. Con- 
vexity of the objects shape is assumed at this point 
to avoid cast shadows. It should be noted that when 
no part of the surface is shadowed, x lies in the 3-D 
subspace £ given by the span of the columns of B. 



If an object is illuminated by A; light sources at in- 
finity, then the image is given by the superposition of 
the images which would have been produced by the 
individual light sources, i.e., 

k 

x = ^max(5s ij 0) (2) 
t=i 

where s% is a single light source. Due to the inherent 
superposition, it follows that the set of all possible im- 
ages C of a convex Lambertian surface created by vary- 
ing the direction and strength of an arbitrary number 
of point light sources at infinity is a convex cone. It is 
also evident from Equation 2 that this convex cone is 
completely described by matrix B. 

This suggests a way to construct the illumination 
model for an individual: gather three or more im- 
ages of the face without shadowing illuminated by a 
single light source at unknown locations but viewed 
under fixed pose, and use them to estimate the three- 
dimensional illumination subspace C. This can be done 
by first normalizing the images to unit length and then 
estimating the best three-dimensional orthogonal basis 
B* using a least- squares minimization technique such 
as singular value decomposition (SVD). Note that the 
basis B* differs from B by an unknown linear transfor- 
mation, i.e., B = B*A where A e GL(3) [10, 13, 22]; 
for any light source s, x = Bs = (B*A)(A~ 1 s). Nev- 
ertheless, both B* and B define the same illumination 
cone and represent valid illumination models. 

Unfortunately, using SVD in the above procedure 
leads to an inaccurate estimate of B*. For even a 
convex object whose Gaussian image covers the Gauss 
sphere, there is only one light source direction (the 
viewing direction) for which no point on the surface is 
in shadow. For any other light source direction, shad- 
ows will be present. If the object is non-convex, such 
as a face, then shadowing in the modeling images is 
likely to be more pronounced. When SVD is used to 
find B* from images with shadows, these systematic 
errors bias its estimate significantly. Therefore, an al- 
ternative way is needed to find B* that takes into ac- 
count the fact that some data values should not be 
used in the estimation. 

We have implemented a variation of [26] (see also 
[28. 16]) that finds a basis B* for the 3-D linear sub- 
space C from image data with missing elements. To 
begin, define the data matrix for c images of an indi- 
vidual to be X = [xi . . .x c ]. If there were no shad- 
owing, X would be rank 3 (assuming no image noise), 
and we could use SVD to factorize X into X = B*S W 
where S* is a 3 x c matrix the columns of which are the 
light source directions scaled by the light intensities 
for all c images. 

Since the images have shadows (both cast and at- 
tached), and possibly saturations, we first have to de- 
termine which data values are invalid. Unlike satura- 
tions which can be trivially determined, finding shad- 
ows is more involved. In our implementation, a pixel is 



assigned to be in shadow if its value divided by its cor- 
responding albedo is below a threshold. As an initial 
estimate of the albedo, we use the average of the mod- 
eling (or training) images. A conservative threshold 
is then chosen to determine shadows making it almost 
certain no invalid data is included in the estimation 
process, at the small expense of throwing away some 
valid data. After finding the invalid data, the following 
estimation method is used: without doing any row or 
column permutations sift out all the full rows (with no 
invalid data) of matrix X to form a full sub-matrix A". 
Note that the number of pixels in an image (i.e. the 
number of rows of X) is much larger than the number 
of images (i.e. the number of columns of X), which 
means we can always find a large number of full rows 
so that the number of rows of X is larger than its 
number of columns. Therefore, perform SVD on X 
to get a fairly good initial estimate of 5*. Fix 5* 
and estimate each of the rows of B* independently us- 
ing least squares. Then, fix B* and update each of 
the light source direction s t - independently, again us- 
ing least squares. Repeat these last two steps until 
estimates converge. In our experiments, the algorithm 
is very well behaved, converging to the global mini- 
mum within 10-15 iterations. Though it is possible to 
converge to a local minimum, we never observed this 
either in simulation or in practice. 

Figure 1 demonstrates the process for constructing 
the illumination model. Figure l.a shows six of the 
original single light source images of a face used in the 
estimation of B*> Note that the light source in each 
image moves only by a small amount (±15° in either 
direction) about the viewing axis. Despite this, the 
images do exhibit shadowing, e.g. left and right of 
the nose. In fact, there is a tradeoff in the image ac- 
quisition process: the smaller the motion of the light 
source, meaning fewer shadows present in the images, 
the worse the conditioning of the estimation problem. 
If. on the other hand, the light source moves exces- 
sively, despite the improvement in the conditioning, 
more extensive shadowing can increase the possibility 
of having too few (less than three) valid measurements 
with a fixed number of images for some parts of the 
face. Therefore, the light source should move in mod- 
eration as in the images shown in Figure l.a. 

Figure l.b shows the basis images of the estimated 
matrix £*. These basis images encode not only the 
albedo (reflectance) of the face but also its surface nor- 
mal field. They can be used to construct images of 
the face under arbitrary and quite extreme illumina- 
tion conditions. However, the image formation model 
in Equation 1 does not account for cast shadows of 
non-convex objects such as faces. In order to deter- 
mine which parts of the image are in cast shadows, 
given a light source direction, we need to reconstruct 
the surface of the face (see next section) and then use 
ray- tracing techniques. 




Figure 1: a) Six of the original single light source im- 
ages used to estimate £*. Note that the light source 
in each image moves only by a small amount (±15° in 
either direction) about the viewing axis. Despite this, 
the images do exhibit shadowing, b) The basis images 
of B\ 

3 Surface Reconstruction 

In this section, we demonstrate how we can generate 
an object's surface from B* after enforcing the inte- 
grability constraint on the surface normal field. It has 
been shown [4, 31] that from multiple images, in which 
the light source directions are unknown, one can only 
recover a Lambertian surface up to a three-parameter 
family given by the generalized bas-relief (GBR) trans- 
formation. This family scales the relief (flattens or ex- 
trudes) and introduces an additive plane. It has also 
been shown that the family of GBR transformations is 
the only one that preserves integrability. 

3,1 Enforcing Integrability 

The vector field B* estimated in Section 2 may not 
be integrable, i.e.. it may not correspond to a smooth 
surface. So, prior to reconstructing the surface up to 
GBR, the integrability constraint must, be enforced on 
B*. Since no method has been developed to enforce the 
integrability during the estimation of B*, we enforce 
it afterwards. That is, given B* estimate a matrix 
A 6 GL(3) such that B = B*A corresponds to an 
integrable normal field; the development follows [31]. 

Consider a continuous surface defined as the graph 
of z(x,2/), and let h(x^y) be the corresponding nor- 



mal field scaled by an albedo field. The integrability 
constraint for a surface is z xy = z yx where subscripts 
denote partial derivatives. In turn, b(x, //) must sat- 
isfy: 

To estimate A such that b T (x.y) = b* T {x,y)A, we 
expand this out. Letting the columns of A be denoted 
by A U A 2 ,A 3 yields 

(b* T A 3 )(b?A 2 ) - (b* T A 2 )(b?A 3 ) = 
(b* T A 3 )(b y T Ay) - (b* T A,)(bf A 3 ) 

which can be expressed as 

b^b^b-^b; (3) 

where Si = A3A2 - A 2 Aj and 5 2 = A 3 Aj - A } Aj. 

S\ and 52 are skew-symmetric matrices and have 
three degrees of freedom. Equation 3 is linear in the 
six elements of S^ and Si. From the estimate of B* 
discrete approximations of the partial derivatives (b* 
and b y ) are computed, and then SVD is used to solve 
for the six elements of Si and S2- In [31], it was shown 
that the elements of S\ and Si are cof actors of v4, and a 
simple method for computing A from the cofactors was 
presented. This procedure only determines six degrees 
of freedom of A. The other three correspond to the 
GBR, transformation [4] and can be chosen arbitrarily 
because a GBR transformation preserves integrability. 
The surface corresponding to B — B*A differs from the 
true surface by GBR, i.e., z(x,y) = \z(x,y) + fix + vy 
for arbitrary X.fi t u with A ^ 0. 

3.2 Generating a GBR surface 

After enforcing integrability, we can now reconstruct 
the corresponding surface z(x,y). Note that z{x^y) is 
not a Euclidean reconstruction of the face, but a rep- 
resentative element of the orbit under a GBR transfor- 
mation. Despite this, both the shading and the shad- 
owing will be correct for images synthesized from such 
a surface [4]. 

To find i(x, y). we use the variational approach pre- 
sented in [15]. A surface z(x,y) is fit to the given 
components of the gradient p and q by minimizing the 
functional 

/ L^ £x ~ p - 2 + ^ ~ ^ 2 dx dy ' 

the Euler equation of which reduces to V 2 z = p x -f q y . 
By enforcing the right natural boundary conditions 
and employing an iterative scheme that uses a discrete 
approximation of the Laplacian, we can reconstruct 
the surface z(x. y) [15]. 

Recall that a GBR transformation scales the re- 
lief (flattens or extrudes) and introduces an additive 



plane. To resolve this GBR. ambiguity, we take ad- 
vantage of the fact that we are dealing with human 
faces which constitute a well known class of objects. 
We can therefore exploit the left-to-right symmetry of 
faces and the fairly constant ratios of distances be- 
tween facial features such as the eyes, the nose, and 
the forehead. (In the case when the class of objects is 
not well defined, the issue of resolving the GBR ambi- 
guity becomes more subtle and is essentially an open 
problem.) A surface of a face that has undergone a 
GBR. transformation will have different distance ratios 
and can be asymmetric. These differences allow us to 
estimate the three parameters of the GBR transfor- 
mation which we can then invert. Note that this in- 
verse transformation is applied to both the estimated 
surface z(x,y) and B. Even though this inverse oper- 
ation (which is also a GBR transformation) may not 
completely resolve the ambiguity of the relief because 
of errors in the estimation of the GBR parameters, it 
nevertheless comes very close to that effect. After all, 
our purpose is not to reconstruct the exact Euclidean 
surface of the face, but to create realistic images of a 
face under differing pose and illumination. Moreover, 
since shadows are preserved under GBR, transforma- 
tions [4], images synthesized under an arbitrary light 
source from a surface whose normal field has been GBR 
transformed will have correct shadowing. This means 
that the residual GBR transformation (after resolving 
the ambiguity) will not affect the image synthesis with 
variable illumination. 

Figure 2 shows the reconstructed surface of the face 
shown in Figure 1 after resolving the GBR ambigu- 
ity. The first basis image of B* shown in Figure l.b 
has been texture-mapped on the surface. Even though 
we cannot recover the exact Euclidean structure of the 
face (i.e. resolve the ambiguity completely), we can 
still generate synthetic images of a face under variable 
pose where the shape distortions due to the residual 
GBR ambiguity are quite small and not visually de- 
tectable. 

4 Image Synthesis 

We first demonstrate the ability of our method to gen- 
erate images of an object under novel illumination con- 
ditions but fixed pose. Figure 3 shows sample single 
light source images of a face generated with the im- 
age formation model in Equation 1 which has been 
extended to account for cast shadows. To determine 
cast shadows, we employ ray- tracing that uses the re- 
constructed surface of the face z(x, y) after resolving 
the GBR ambiguity. Specifically, a point on the surface 
is in cast shadow if. for a given light source direction, 
a ray emanating from that point parallel to the light 
source direction intersects the surface at some other 
point. With this extended image formation model, 
the generated images exhibit realistic shading and, de- 
spite the small presence of shadows in the images in 
Figure l.a, have strong attached and cast shadows. 
Figure 4 displays a set of synthesized images of the 



Figure 2: The reconstructed surface. 

the face viewed under variable pose but with fixed 
lighting. The images were created by rigidly rotat- 
ing the reconstructed surface shown in Figure 2 first 
about the horizontal and then about the vertical axis. 
Along the rows from left to right, the azimuth varies 
(in 10 degree intervals) from 30 degrees to the right of 
the face to 10 degrees to the left. Down the columns, 
the elevation varies (again in 10 degree intervals) from 
20 degrees above the horizon to 30 degrees below. For 
example , in the bottom image of the second column 
from the left the surface has an azimuth of 20 degrees 
to the right and an elevation of 30 degrees below the 
horizon. The single light source is following the face 
around as it changes pose. This implies that a patch 
on the surface has the same intensity in all poses. It is 
interesting to see that the images look quite realistic 
with maybe the exception of the three right images in 
the bottom row which appear to be a little flattened. 
This is not due to any errors during the geometric or 
photometric modeling but probably due to our visual 
priors; we axe not used to looking at a face from above. 

In Figure 5. we combine both variations in viewing 
conditions to synthesize images of the face under novel 
pose and lighting. We used the same poses as in Fig- 
ure 4 but now the light from the single point source is 
fixed to come along the gaze direction of the face in the 
top-right image. Therefore, as the face moves around 
and its gaze direction changes with respect to the light 
source direction, the shading of the surface changes 
and both attached and cast shadows are formed, as 
one would expect. The synthesized images seem to 
agree with our visual intuition. 




Figure 3: Sample images of the face under novel illu- 
mination conditions but fixed pose. 

5 Discussion 

Appearance variation of an object caused by small 
changes in illumination under fixed pose can provide 
enough information to estimate (under the assumption 
of a Lambertian reflectance function) the object's sur- 
face normal field scaled by its albedo. In the presented 
method, as few as three images with no knowledge of 
the light source directions can be used in the estima- 
tion. The estimated surface normal field can then be 
integrated to reconstruct the object's surface. Unlike 
multi-view based image synthesis, our approach does 
not require the determination of point or line corre- 
spondences to do the surface reconstruction. Since we 
are dealing with a well known class of objects, we can 
acceptably resolve the GBR, ambiguity of the recon- 
structed surface. Then, the surface together with the 
surface normal field scaled by the albedo are sufficient 
for synthesizing images of the object under novel pose 
and lighting. 

The effectiveness of our approach stems from three 
reasons. First, the estimation of the illumination 
model B* does not use any invalid data (such as shad- 
ows) which would otherwise lead to large biases. Sec- 



ond. the integr ability constraint is enforced on the sur- 
face normal field which significantly improves the sur- 
face reconstruction. Last, unlike classical photomet- 
ric stereo, our method requires no knowledge of light 
source locations. This obviates the need of error-prone 
calibration of a fixed lighting rig where any errors in 
estimating the position of the light sources can propa- 
gate to the estimation of the illumination model caus- 
ing large inaccuracies. These reasons have to led to 
improved performance and wc have demonstrated this 
by synthesizing realistic images of human faces. 

References 

[1] S. Avidan, T. Evgeniou, A. Shashua, and T. Pog- 
gio. Image-based view synthesis by combining trilinear 
tensors and learning techniques. In ACM Symposium 
on Virtual Reality Software and Technology, 1997. 

[2] S. Avidan and A. Shashua. Novel view synthesis in 
tensor space. In Proc. IEEE Con}, on Comp. Vision 
and Patt. Recog., pages 1034-1040, 1997. 

[3] E. Barett, M. Brill, N. Haag, and P. Payton. Invariant 
linear methods in photogrammetry and model match- 
ing. In J. Mundy and A. Zisserman, editors, Geometric 
Invariance in Computer Vision, pages 277 292. MIT 
Press, 1992. 

[4] P. Bclhumcnr, D. Kricgman, and A. Yuillc. The bas- 
relief ambiguity. In Proc. IEEE Conf. on Comp. Vi- 
sion and Patt Recog., pages 1040-1046, 1997. 

[5] P. N. Belhumeur and D. J. Kriegman. What is the 
set of images of an object under all possible lighting 
conditions? In Proa. IEEE Conf. on Comp. Vision 
and Patt. Recog., pages 270-277, 1996. 

[6] R. Carceroni and K. Kutulakos. Shape and motion 
of 3-d curves from multi-view image scenes. In Image 
Understanding Workshop, pages 171 176, 1998. 

[7] S. Chen and L. Williams. View interpolation for image 
synthesis. In Computer Graphics (SIGGRAPH), pages 
279 288, 1993. 

[8] G. Chou and S. Teller. Multi-image correspondence 
using geometric and structural constraints. In Image 
Understanding Workshop, pages 869 874, 1997. 

[9] P. Debevec, C. Taylor, and ,T. Malik. Modeling and 
rendering architecture from photographs: A hybrid 
geometry- and image-based approach. In Computer 
Graphics (SIGGRAPH), pages 11-20, 1996. 

[10] R. Epstein. A. Yuille, and P. N. Belhumeur. Learning 
and recognizing objects using illumination subspaces. 
In Proc. of the Int. Workshop on Object Representa- 
tion for Computer Vision, 1996. 

[11] A. Georghiades, D. Kriegman, and P. Belhumeur. Il- 
lumination cones for recognition under variable light- 
ing: Faces, In Proc. IEEE Conf. on Comp. Vision and 
Patt. Recog., 1998. ' 

[12] S. .1. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. 
Cohen. The Lumigraph. In Computer Graphics (SIG- 
GRAPH) t pages 43-54, 1996. 

[13] H. Hayakawa. Photometric stereo under a light-source 
with arbitrary motion. JOSA-A, 11(11):3079 3089, 
Nov. 1994. 



[14; 

[is: 
[ie: 

[iT 

[is: 

[19; 
[20! 

[21 

[22: 

[23; 
[24: 

[25 

[26: 

[27] 

[28: 

[29: 
[30: 

[31 



B. Horn. Computer Vision. MIT Press, Cambridge, 
Mass., 1986. 

B. Horn and M. Brooks. The variational approach to 
shape from shading. Computer Vision, Graphics and 
Image Processing, 35:174-208. 1992. 

D. Jacobs. Linear fitting with missing data: Appli- 
cations to structure from motion and characterizing 
intensity images. In Proc. IEEE Conf. on Comp. Vi- 
sion and Patt. Recog., 1997. 

J. Koenderink and A. Van Doorn. Affine structure 
from motion. JOSA-A, 8(2):377 385, 1991. 

S. Laveau and O. Faugeras. 3-D scene representa- 
tion as a collection of images and fundamental matri- 
ces. Technical Report 2205, INRIA-Sophia Antipolis, 
February 1994. 

M. Levoy and P. Hanrahan. Light field rendering. In 
Computer Graphics (SIGGRAPH), pages 31 42, 1996. 

W. R. Mark, L. McMillan, and G. Bishop. Post- 
rendering 3d warping. In Computer Graphics (SIG- 
GRAPH), pages 39-46, 1997. 

L. Matthics, R. Szeliski, and T. Kanadc. Kalman 
filter-based algorithms for estimating depth from im- 
age sequences. Int. J. Computer Vision, 3:293-312, 
1989. 

R. Rosenholtz and J. Koenderink. Affine structure and 
photometry. In Proc. IEEE Conf. on Comp. Vision 
and Patt. Recog., pages 790 795, 1996. 

S. Seitz and C. Dyer. View morphing. In Computer 
Graphics (SIGGRAPH), pages 21-30, 1996. 

J. Shade, S. Gortler, L. wei He, and R. Szeliski. 
Layered depth maps. In Computer Graphics (SIG- 
GRAPH), pages 251-258, 1998, 

A, Shashua. Geometry and Photometry in 3D Visual 
Recognition. PhD thesis, MIT, 1992. 

H, Shum, K. Ikeuchi, and R. Reddy. Principal com- 
ponent analysis with missing data and its application 
to polyhedral object modeling. IEEE Trans. Pattern 
Anal. Mach. Intelligence, 17(9):854 867, September 
1995. 

W. Silver. Determining Shape and Reflectance Using 
Multiple Images. PhD thesis, MIT, Cambridge, MA, 
1980. 

C. Tomasi and T, Kanade. Shape and motion from 
image streams under orthography: A factorization 
method. Int. J. Computer Vision, 9(2):137-154, 1992. 

R. Woodham. Analysing images of curved surfaces. 
Artificial Intelligence, 17:117-140, 1981. 

Y. Yu and J. Malik. Recovering photometric proper- 
ties of architectural scenes from photographs. In Com- 
puter Graphics (SIGGRAPH), pages 207-218, 1998. 

A. Yuille and D. Snow. Shape and albedo from multi- 
ple images using integrability. In Proc. IEEE Conf. on 
Comp. Vision and Patt. Recog., pages 158 164, 1997. 



t 




Figure 4: Synthesized images under variable pose but with fixed lighting; the single light source is following the 
face. 
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Abstract 

We demonstrate that a sm,all number of 2D sta- 
tistical models are sufficient to capture the shape and 
appearance of a face from any viewpoint (full profile 
to fronto-parallel) . Each model is linear and can be 
matched rapidly to new images using the Active Ap- 
pearance Model algorithm. We show how such a set of 
models can be used to estimate head pose, to track faces 
through large angles of head rotation and to synthesize 
faces from unseen viewpoints. 



1 Introduction 

The appearance of a face in a 2D image can change 
dramatically as the viewing angle changes. The major- 
ity of work on face tracking and recognition assumes 
near fronto-parallel views, and tends to break down 
when presented with large rotations or profile views. 
Three general approaches have been used to deal with 
this; a) use a full 3D model [15], b) introduce non- 
linearities into a 2D model [6] and c) use a set of models 
to represent appearance from different view points [11]. 
In this paper we explore the last approach, using statis- 
tical models of shape and appearance to represent the 
variations in appearance from a particular viewpoint. 

These appearance models are trained on example 
images labelled with sets of landmarks to define the 
correspondences between images [1]. Lanitis et al[S\ 
showed that a linear model was sufficient to simulate 
considerable changes in viewpoint, as long as all the 
modelled features (the landmarks) remained visible. 
A model trained on near fronto-parallel face images 
can cope with pose variations of up to 45° either side. 
For much larger angle displacements, some features be- 
come occluded, and the assumptions of the model break 
down. 

We demonstrate that to deal with full 180° ro- 
tation (from left profile to right profile), we need 



only 5 models, roughly centred on viewpoints at - 
90° r 45 o ,0°,45 o J 90 o (where 0° corresponds to fronto- 
parallel). The pairs of models at ±90° (full profile) 
and ±45° (half profile) are simply reflections of each 
other, so there are only 3 distinct models. We can 
use these models for estimating head pose, for track- 
ing faces through wide changes in orientation and for 
synthesizing new views of a subject given a single view. 

Each model is trained on labelled images of a variety 
of people with a range of orientations chosen so none 
of the features for that model become occluded. The 
different models use different sets of features (see Fig- 
ure 1). Each example view can then be approximated 
using the appropriate appearance model with a vector 
of parameters, c. We assume that as the orientation 
changes, the parameters, c, trace out an approximately 
elliptical path. We can learn the relationship between c 
and head orientation, allowing us to both estimate the 
orientation of any head and to be able to synthesize a 
face at any orientation. 

By using the Active Appearance Model algorithm 
[4, 1] we can match any of the individual models to 
a new image rapidly. If we know in advance the ap- 
proximate pose, we can easily select the most suitable 
model. If we do not know } we can search with each of 
the five models and choose the one which achieves the 
best match. Once a model is selected and matched, we 
can estimate the head pose, and thus track the face, 
switching to a new model if the head pose varies sig- 
nificantly. 

Given a single image of a new person, we can match 
the models to estimate the pose. We can then use the 
best fitting model to generate new views from angles 
similar to that of the original image. We can also ex- 
ploit correlations across models of different views to 
estimate the appearance of the subject in a completely 
different view. Though this can perhaps be done most 
effectively with a full 3D model [15], we demonstrate 
that good results can be achieved just with a set of 2D 
models. 



In the following we describe the techniques in more 
detail and give examples of the model, its ability to 
estimate pose, to track faces and to synthesize unseen 
views. 

2 Background 

Statistical models of shape and texture have been 
widely used for recognition, tracking and synthesis [7, 
9, 4. 14], but have tended to only be used with near 
front o- parallel images. 

Moghaddam and Pentland [1 1] describe using view- 
based eigenface models r,o represent a wide variety of 
viewpoints. Our work is similar to this, but by in- 
cluding shape variation (rather than the rigid eigen- 
patches), we require fewer models and can obtain bet- 
ter reconstructions with fewer model modes. 

Maurer and von der Malsburg [10] demonstrated 
tracking heads through wide angles by tracking graphs 
whose nodes are facial features, located with Gabor 
jets. The system is effective for tracking, but is not able 
to synthesize the appearance of the face being tracked. 

Murase and Nayar [6] showed that the projections 
of multiple views of a rigid object into an eigenspace 
fell on a 2D manifold in that space. By modelling this 
manifold they could recognise objects from arbitrary 
views. A similar approach has been taken by Gong 
et ai [13, 8] who use non-linear representations of the 
projections into an eigen-face space for tracking and 
pose estimation, and by Graham and Allinson [5] who 
use it for recognition from unfamiliar viewpoints. 

Romdhani et. ai [12] has extended the Active Shape 
Model to deal with full 180° rotation of a face using a 
non-linear model. However, the non-linearities mean 
the method is slow to match to a new image. 

Vetter [15] has demonstrated how a 3D statistical 
model of face shape and texture can be used to gener- 
ate new views given a single view. The model can be 
matched to a new image from more or less any view- 
point using a general optimisation scheme, though this 
is slow. By explicitly taking into account the 3D nature 
of the problem, this approach is likely to yield better 
reconstructions than the purely 2D method described 
below. However, the view based models we propose 
could be used to drive the parameters of the 3D head 
model, speeding up matching times. 

3 Statistical Models of Appearance 

An appearance model can represent both the shape 
and texture variability seen in a training set. The train- 
ing set consists of labelled images, where key landmark 



points are marked on each example object. The train- 
ing set is usually labelled manually, though automatic 
methods are being developed. For instance. Figure 1 
shows examples of labelled images used to train the 
view- based face models. 




Profile 



Half Profile 



Frontal 



Figure 1. Examples from the training sets for 
the models 



Given such a set we can generate a statistical mod- 
els of shape and texture variation (see [1, 4] for de- 
tails). The shape of an object can be represented as 
a vector x and the texture (grey-levels or colour val- 
ues) represented as a vector g. The appearance model 
has parameters, c, controlling the shape and texture 
according to 



= x + Q s c 



(l) 



where x is the mean shape, g the mean texture and 
Q s .Q a are matrices describing the modes of variation 
derived from the training set. 

We trained three distinct models on data similar 
to that shown in Figure 1. The profile model was 
trained on 234 landmarked images taken of 15 individ- 
uals from different orientations. The half-profile model 
was trained on 82 images, and the frontal model on 294 
images. 

An example image can be synthesised for a given c 
by generating a texture image from the vector g and 
warping it using the control points described by x. For 
instance. Figure 2 shows the effects of varying the first 
two appearance model parameters, a , c 2 , of models 
trained on a set of face images, labelled as shown in 
Figure 1. These change both the shape and the texture 
component of the synthesised image. 
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c\ varies ±2 s.d.s C2 varies ±2 s.d.s 



Figure 2. First two modes of the face models 
(top to bottom: profile, half-profile and frontal) 



4 Predicting Pose 

We assume that the model parameters are related 
to the viewing angle, 8, approximately as 

c = c 0 + cos(#) + c s sin(0) (2) 

where c 0 , c c and c, are vectors estimated from train- 
ing data (see below). 

(Here we consider only rotation about, a vertical axis 
- head turning. Nodding can be dealt with in a similar 
way.) 

This is an accurate representation of the relation- 
ship between the shape, x. and orientation angle under 
an affine projection (the landmarks trace circles in 3D 
which are projected to ellipses in 2D), but our exper- 
iments suggest it is also an acceptable approximation 
for the appearance model parameters, c. 

In order to learn the relationship for a given model, 
we must know the orientation of each of our train- 
ing examples. We do not yet have access to a system 
which can measure it accurately, such as that used by 
[12. 8. 13]. However, we are able to estimate the angle 
by finding the frames in our training sequences at full 
profile and fronto-parallel by eye, then assuming a con- 
stant rate of rotation across the frames between. This 
leads to images labelled with orientations, accurate 
to about ±10°. For each such image we find the best 
fitting model parameters, We then perform regres- 
sion between {c;} and the vectors {(1, cos(^), sin(0i))'} 
to learn c 0 ,c c and c s . 

Figure 3 shows reconstructions in which the orien- 
tation, 9. is varied in Equation 2. 

Given a new example with parameters c, we can 
estimate its orientation as follows. Let R^ 1 be 
the left pseudo- inverse of the matrix (c c |c e ) (thus 
R c - 1 (c c |c 5 ) = I 2 ). 




-105° -80° -60° 




-60° -40° -20° 




-45° 0 +45° 



Figure 3. Rotation modes of three face models 



Let 

(x aj y a y = H^(c- c 0 ) (3) 

then the best estimate of the orientation is 
tan _1 (j/ a /x a ). 

Figure 4 shows the predicted orientations vs the ac- 
tual orientations for the training sets for each of the 
models. It demonstrates that equation 2 is an accept- 
able model of parameter variation under rotation. 

5 Tracking through wide angles 

We can use the set of models to track faces through 
wide angle changes (full left profile to full right profile). 
We use a simple scheme in which we keep an estimate of 
the current head orientation and use it to choose which 
model should be used to match to the next image. 

To track a face through a sequence we locate it in the 
first frame using a global search scheme similar to that 
described in [3]. This involves placing a model instance 
centred on each point on a grid across the image, then 
running a few iterations of the A AM algorithm. Poor 
fits are discarded and good ones retained for more it- 
erations. This is repeated for each model, and the best 
fitting model is used to estimate the position and ori- 
entation of the head. 



3 



150 


1 1- " 


i i 


inn 

«— ^ lUU 






d) 

5- 50 






O) 

1 0 
















5 -50 

o 






a -100 






-150 




• 



-150 -100 -50 0 50 100 150 
Actual Angle (deg.) 
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training set 



Model 


Angle 


Range 


Left Profile 


-110° 


- -60° 


Left Half-Profile 


-60° 


- -40° 


Frontal 


-40° 


- 40° 


Right Half-Profile 


40° 


- 60° 


Right Profile 


60° - 


110° 



Table 1. Valid angle ranges for each model 

We then project the current best model instance into 
the next frame and run a multi-resolution seach with 
the A AM. We estimate the head orientation from the 
results of the search, as described above. We then use 
the orientation to choose the most appropriate model 
with which to continue. Each model is valid over a par- 
ticular range of angles, determined from its training set 
(see Table 1), If the orientation suggests changing to 
a new model, we estimate the parameters of the new 
model from those of the current best fit. We then per- 
form an A AM search to match the new model more ac- 
curately. This process is repeated for each subsequent 
frame, switching to new models as the angle estimate 
dictates. 

When switching to a new model we must estimate 
the image pose (position, within image orientation and 
scale) and model parameters of the new example from 
those of the old. We assume linear relationships which 
can be determined from the training sets for each 
model, as long as there are some images (with interme- 
diate head orientations) which belong to the training 
sets for both models. 

Figure 7 shows the results of using the models to 
track the face in a new test sequence (in this case a 
previously unseen sequence of a person who is in the 
training set). The model reconstruction is shown su- 



perimposed on frames from the sequence. The methods 
appears to track well, and is able to reconstruct a con- 
vincing simulation of the sequence. 

We used this system to track 15 new sequences of 
the people in the training set. Each sequence contained 
between 20 and 30 frames. Figure 5 shows the estimate 
of the angle from tracking against the actual angle. In 
all but one case the tracking succeeded, and a good es- 
timate of the angle is obtained. In one case the models 
lost track and were unable to recover. 

The system currently works off-line, loading se- 
quences from disk. On a 450MHz Pentium III it runs 
at about 3 frames per second, though so far little work 
has been done to optimise this. 



100 




Actual Angle (deg.) 



Figure 5. Comparison of angle derived from 
AAM tracking with actual angle (15 sequences) 



6 Predicting Unseen Views 

Given a single view of a new person, we can find 
the best model match and determine their head orien- 
tation. We can then use the best model to synthesize 
new views at any orientation that can be represented 
by the model. If the best matching parameters are c, 
we use equation 3 to estimate the angle, 0. Let c res 
be the residual vector not explained by the rotation 
model, ie 

c res = c - (c 0 + c c cos(0) + c s sin(0)) (4) 

To reconstruct at a new angle, a, we simply use the 
parameters 
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c(a) = c 0 4- c c cos(a) + c 8 sin(a) 4- c re8 (5) 

This only allows us to vary the angle in the range 
defined by the closest model. Since the models all rep- 
resent the same 3D structure, we anticipate that there 
will be correlations between parameters for different 
views of the same individual. To do this effectively we 
must first project out the effects of pose, lighting etc. 
A principled approach to this is described in [2]. How- 
ever, for our experiments, since there is little lighting 
or expression change in the training set, it is sufficient 
just to remove the orientation components. 

In order to learn the relationship between param- 
eters in one model and those in another, we perform 
the following steps. For each frame in the training set 
we use equation 4 to determine the orientation inde- 
pendent component of the parameters for each model. 
We then compute the mean of such residuals for each 
person. Let c^- be the mean of such residuals in the 
i ih model for the j th person. By applying PC A to the 
means for a given model, we can find the projection, 
Pj , into an 'identity' sub-space. 

Let the projection of each mean in the subspace be 

by = Pj(c tij - C,) (6) 

where Cj is the mean of the means. 

We can use linear regression to learn the relationship 
which maps each by in the identity space of the j th 
model to the corresponding mean bik in the identity 
space of the k? ,h model. 

by = r jk +Rjkbik (7) 

Thus to reconstruct a new view of a person given a 
match in a different view; 

1. remove the effects of orientation (Eq.4), 

2. project into the identity sub-space for the model 
(Eq.6), 

3. project across into the subspace of the target model 
(Eq.7), 

4. project that into the residual space (inverting Eq.6) 

5. add the appropriate orientation (Eq. 5). 

Figure 6 demonstrates this. Models were built on 
the data for all but one person. The profile model was 
then matched to a profile image of the missing person 
(the reconstruction is shown). The method described 
above is then used to predict the appearance using the 
frontal model at two different angles. For comparison, 
corresponding images of the person at similar angles 
are shown. Given the small nature of the training set 
(in this case only 14 people, yielding a 13-D identity 
space), the results are encouraging. 




Best Fit New View New View 



Figure 6. The best fit with a profile model is 
projected to the frontal model to predict new 
views 



7 Discussion and Conclusions 

We have demonstrated that a small number of view- 
based statistical models of appearance can represent 
the face from a wide range of viewing angles. Although 
we have concentrated on rotation about a vertical axis, 
rotation about a horizontal axis (nodding) could easily 
be included (and probably wouldn't require any extra 
models for modest rotations). We have shown that 
the models can be used to track faces through wide 
angle changes, and that they can be used to predict 
appearance from new viewpoints given a single image 
of a person. 

So far we have only tested the methods on a rela- 
tively small and clean data set. We intend to gather 
more data in order to obtain better generalisation abil- 
ity, to include expression and lighting changes and to 
investigate its performance on more cluttered back- 
grounds. We hope to obtain better calibrated train- 
ing images in order to obtain more accurate angle es- 
timates. 

We anticipate the approach will be useful in many 
applications, including driving animated avatars, cal- 
culating head pose and making face recognition sys- 
tems more invariant to viewing angle. 
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Figure 7. Reconstruction of tracked faces su- 
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Abstract 

In this paper, we propose a new system for esti- 
mating face pose from a facial image. In this system, 
input facial image is compared with database of images 
of various face pose, then the matched image provides 
the face pose. The database of images includes not on- 
ly various face pose but also various illumination con- 
ditions, so that the face pose estimating system can 
be used under various illumination condition. For col- 
lecting such various facial images, they are generated 
by computer, rather than taking real images. Eigen s- 
pace method is used for searching the matched image 
with input facial image. Since various illumination 
images are collected in the database of facial images, 
the extracted principle eigen vectors mostly depend on 
the face pose. By performing the matching process in 
the eigen space, matched image with the input facial 
image can be found. The pose of the matched image is 
most closest to the input face. The matching process is 
also fast because it is performed in small dimensional 
space spanned by only selected eigen vectors. The pro- 
posed pose estimating system can continuously track 
the face pose of different person under different light 
condition. 

1 Introduction 

Recently, human-computer interface is intensively 
studied for making computers usable for every people. 
The recent computers have not only the displays but 
also cameras for taking images of users. This indicates 
that such cameras on computer can be used for input 
device of the user's behavior, so that more natural 
interface can be realized. 

For recognition of user's behavior from the images, 
automatic face pose estimation technique is one of ap- 
plication of computer vision and image understanding 
research field. Conventionally, there are many meth- 
ods for estimating the pose of face from images, which 
are categorized into : 1.) .methods based on detect- 
ing of face features such as eyes, noses, mouth, etc., 
[1J[2], 2.) methods based on the intensity distribution 
of images [5][6]. The former methods generally involve 
3D position estimation of the features, thus accurate 
camera parameters must be known for face pose es- 
timation. Additionally, the feature detection by the 
image understanding techniques is still hard problem 
under arbitrary conditions, such as illumination and 
background scenes. The latter methods are basically 
model-based method in which the image models ob- 



tained previously axe used for face pose estimation. 
Those methods do not need feature detection, but it 
takes much labors for collecting the models before es- 
timation of the face pose. 

In this paper, we propose a new face pose estimat- 
ing system of model-based method. In this system, 
computer generated facial images are used for reduc- 
ing the cost for collecting the images of various model- 
s. By the use of computer generated model, it is very 
easy to change the conditions of the facial images such 
as illumination. The database of facial images includes 
not only various face pose but also various illumina- 
tion conditions, so that the face pose estimating sys- 
tem can be used under various illumination condition. 
The parametric eigen space method [3] is used for ex- 
tracting some principle vectors of facial image that 
mostly depends on the face pose. By performing the 
matching process in the space spanned by the prin- 
ciple vectors, the input facial image can be matched 
with images in the database, of which the pose is most 
closest to the input face. 

In this paper, we also show the results of pose 
estimation from various conditions facial images for 
demonstrating the efficacy of the proposed system. 

2 Face Pose Estimation Based on 
Parametric Eigen Vector Method 

Murase et.al. proposed the parametric eigen space 
technique [3] in which important feature vectors (prin- 
ciple vectors) of the database can be extracted by the 
eigen space analysis. They apply this method to objec- 
t recognition from the appearance in which each mod- 
el identity is represented in the eigen space spanned 
by the principle vectors. In this method, the storage 
capacity of the database can be reduced because the 
extracted important feature vectors in small dimen- 
sion are only required for matching procedure of the 
recognition. This reduction also helps to reduce the 
computation cost for matching of input features with 
the features of database. In this method, KL transfor- 
m is employed for extract several important principle 
vectors of the data set in the database. Consequently, 
the objects can be represented with small dirnension- 
s (i.e. 20) rather than the original dimensions, (i.e. 
256 2 ). 

In our face pose estimation system, the paramet- 
ric eigen space method is used for extracting principle 
vectors such that the principle vectors face mostly de- 
pends on the face pose. 
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Figure 1: The flowchart of the learning stage. 



An intensity distribution of image mostly depends 
on the face pose, personality of face, and illumina- 
tion condition. The personality means that personal 
features such as the position of the mouse, eyes, and 
nose, the shape of the face, etc. However, such per- 
sonality is difficult to be represented by the small di- 
mension principle vectot because of the complication 
of the personal features. Thus, such personal features 
is difficult to be extracted as principle feature vectors 
by the eigen space technique. 

The illumination condition mainly affects to the in- 
tensity distribution of image. However, if the set of 
facial images include images under variety illumina- 
tion conditions, the contribution of the illumination 
condition to the principle feature vectors can be re- 
duced. 

In this way, face pose is estimated by construct- 
ing the eigen space spanned by the principle vectors 
extracted from facial image data taken under various 
illumination. 



3 Pace Pose Estimating System 

The proposed pose estimation system is divided in- 
to two stages; learning stage, and estimation stage. In 
the learning stage, the principle vectors are extract- 
ed from the facial image data base. In the estimation 
stage, the face pose of the input image is estimated 
in the eigen space that is spanned by the principle 
vectors. 

Figure 1 and figure 2 show the flow of each stage. 
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Figure 2: The flowchart of the estimation stage. 



3.1 Learning stage 

3.1.1 Collection of facial images 

The facial images under various illumination and vari- 
ous face pose are collected by computer graphics (CG). 
The use of the computer generated facial images for 
the database has two advantages: 

• Collection of various facial images is easy. 

• Control of face pose and environmental condition 
is easy. 

If we need to collect various facial images from real 
human, much labor effort is required for collection of 
the images. We need to know the angle of face pose for 
each facial image in the database used for the learning. 
The angle of face pose in CG image is easily to be ob- 
tained, while the angle in real facial image is difficult. 
We also need to collect various face pose images under 
various environment such as illumination. It is easy to 
change such environment for the computer generated 
facial images. For such purposes, we generate the fa- 
cial images by computer graphics rather than taking 
real images. 

Although various face shapes can reduce the depen- 
dency of the feature of face shape to the principle fea- 
ture vectors, we use only one face shape model in this 
paper. This is because that we assume that difference 
in different person's face shape gives much smaller ef- 
fect to the difference in intensity distribution than the 
face pose difference. 
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3.1.2 Extraction of eigen space 

An facial image x with n x n pixels is defined as 

X = [z ll Z 3) --.,Z n a]. (1) 

The set of N facial images is indicated as 

[xi,x ai .-. | xj V ]. (2) 

The average value c in the images axe subtracted from 
every image, then the matrix X with n a rows xN 
columns is represented as 



X= [Xi - c,x 2 - c^.-.xjv -c]. 
The covariance matrix Q of X is derived as 

Q = XX T . 
Then the eigen equation can be shown as 



(3) 
(4) 

X iGi = Qe,. (5) 

With this equation, eigen values and eigen vectors can 
be calculated so that eigen space with dimension k 
(« n a ) can be constructed. 

After the construction of eigen space, the images 
are projected onto the eigen space for making the 
database of the eigenvectors of all the facial images. 
Because the dimension k of the eigen space is much 
smaller than the dimension of the image n 2 , the re- 
quired storage for the database can be reduced by the 
factor of k/n 2 . 

3.2 Pose Estimation Stage 

3.2.1 Extraction and normalization of facial 
image 

For estimating of face pose, face area must be extract- 
ed from input image. The face area is defined from 
brow to chin and from left ear to right ear. 

For extracting the face area from input image, col- 
or information of the input image is used. First, the 
face candidate region is extracted by thresholding hue 
and saturation of the input image. The threshold val- 
ues for hue and saturation are defined according to 
the distribution of face color. The extracted region- 
s, which is represented in binary mask image, include 
not only the face region but also some regions of non 
facial objects. To remove the non facial regions, size 
and shape of every region is calculated after the label- 
ing of the regions. The face area is selected based on 
the size and shape of the region. 

3.2.2 Pose Estimation 

The normalized input image y is projected onto the 
eigen space obtained as the previous section. 



[ei,e2 t .-.,e fc ]?(y-c). 



(6) 



For estimating the pose of the input image, the Euclid 
distance between eigen vector of input image z and 
that of images in the database g tf . 
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Figure 3: Euclid distance between input image and 
database images in Eigenspace. 



The pose of the facial image which has minimum dis- 
tance di is initial guess of the face pose. 

This matching calculation cost can be reduced be- 
cause the distance is calculated in the eigen space of 
dimension k. 

In figure 3, an example of the Euclid distance be- 
tween input image and the images of the database in 
eigen space is shown. The face angle at smallest dis- 
tance is the initial guess of face pose. 

The initial guess of the face pose is compared with 
the estimated face pose at the previous image frame. 
If the pose difference is larger than pre- deter mined 
threshold value, a local minimum is searched around 
the face pose angle of the previous image frame, and 
the angle of the local minimum is the final estimation 
of the face pose. Such correction based on the tem- 
poral continuity of the face pose avoids eventual error 
in face pose estimation. The flow of this procedure is 
shown in figure 4. 

4 Experiments 

4.1 Experimental Conditions 

In this experiment, we prepare database of images 
of 29 different face angles (-70 deg.< 9 < 70 deg. , at 

5 deg. interval) under 6 different illumination condi- 
tions, then 174 facial images in total. Figure 5 shows 
example images of the facial images. The illumination 
condition is changed by the combination of point light 
source and ambient light source at different position 
as shown in figure 6. 

Since various illumination images are collected in 
the set of facial images, the extracted principle eigen 
vectors mostly depend on the face pose. Therefore, 
the proposed system can be used under arbitrary illu- 
mination condition. 

For determining the proper dimension of eigen s- 
pace, we performed test experiment. In the test ex- 
periment, we investigated the relationship between the 
estimated pose angle and dimension of the eigen space. 
Figure 7 shows the result of the test experiment. This 
result shows that there is no difference in the case of 
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Figure 5: Examples of facial images in the database. 



dimension more than 11. FVom this result, we con- 
clude that 20 is sufficient for estimating face pose in 
this experiment. Figure 8 shows the basis images un- 
der 4th dimension in this eigen space. 

4.2 Pose Estimation Results 

In figure 9, examples of the system results are 
shown. Top left image is input image, and top right 
image is masked image extracted by thresholding hue 
and saturation. Although the face pose estimation is 
performed using gray images, the color information of 
input image is used for automatic extraction of the 
face region. Bottom left image is input to face pose 
estimation process, which is normalized into size of 
128 x 128 pixels. Bottom right image is matched im- 
age in the database. Other examples of pose estima- 
tion by the proposed system are also shown in figure 
10. 

The pose face estimation is continuously performed 
to input image sequence. The estimation is sometimes 
wrong, but such wrong estimation can be corrected by 
checking the difference with the previous estimation. 
By the correction of the wrong estimation, the track- 
ing of the face pose can be performed in reasonable 
quality. 

In figures 9 and 10, examples for the face with grass- 



•> light? 
w Ilght2 ^iLf' 

^ .^^^vcs 




direction of faea 



Figure 6: Position of the light source for generating. 



so 



40 



20 



-20 



-60 



Estimated Angle ' 
^ ..Estimated Angle 
/ ^snfrpjjBfl Angle 




S 10 15 20 

Dlmentions of Eigenspace 



Figure 7: Relationship between the dimension of the 
eigen space and estimated face pose 



es are shown. Those cases demonstrate that the detail 
feature does not affect to the pose estimation because 
the eigen space method can only extract the feature 
depending on the face pose. Those examples also show 
the robustness in the face pose estimation in our sys- 
tem. 

This system is actually constructed in SGI- 02 
(R5000, 180MHz) with 02 Cam. It takes about 2 
seconds for face pose estimation of 1 frame. 

5 Conclusion 

We propose a method for estimation of face pose 
using eigen space method. 

In this method, the computation cost and storage 
capacity are much smaller than correlative matching 
method in the image space, because the matching is 
performed in the eigen space of small dimensions. Fur- 
thermore, since the eigen space is mostly depending on 
only the feature of face pose difference, the detailed 
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deference in the input images, i.e. the existence of 
grass, do not affect to the pose estimation. 

In this method, 3D geometrical information is not 
required to estimate face pose, because pose estima- 
tion is performed according to the appearance of the 
face in 2D image. 

Computer graphics is employed for collecting facial 
images for learning in this system. The use of comput- 
er graphics reduces much effort to collect facial images 
under various situation. These images are not real im- 
ages, but it is enough to estimate the face pose. 
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Figure 9: Input image (top left), extracted face area 
(top right), normalized facial image (bottom left) and 
estimated pose (bottom right). The rectangle area 
represents the extracted area as a face region. 




Figure 10: Results of pose estimation. 
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